Introduction to Open Data Science - Course Project
1 About the project
1.1 A short description and a link to the githuib repository
This is a course that should help me to better work with data analysis in the future. I already love markdown :). I have R knowledge, but I never completed a course in the language, so sometimes I find I am doing very simple things the hardest way imaginable. I hope this course will close these gaps in my knowledge.
The crash course is nice, as I’m not that used to the tidyverse syntax, and was taught to use base R (subsets, slices and such). I use git for my work, so it’s not very convenient for me to start using PATs. But i checked, and it indeed works. I just prefer to do it through gpg.
Here’s the link to the repository
link to the github
repository
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Mon Dec 11 21:55:47 2023"
The text continues here.
2: Regression and model validation
Describe the work you have done this week and summarize your learning.
2.1 Data wrangling.
After understanding the metadata file and skipping the Finnish language instructions that were unnecessary it was easy to prepare a data file this data file. The script can be found here.I used plyr and tiidyverse to do the conversions of the raw data into a workable file.
2.2 Analysis
The dataset is the result of a survey on teaching and learning that was conducted as a research project in 2014. The dataset was loaded in the next chunk and the dimensions and structure of the document are output.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
learning2014_csv <- read_csv("/home/alex/ods_course/IODS-project_R/data/learning2014_AA.csv")
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(learning2014_csv)
## [1] 166 7
str(learning2014_csv)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. age = col_double(),
## .. attitude = col_double(),
## .. deep = col_double(),
## .. stra = col_double(),
## .. surf = col_double(),
## .. points = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
The data.frame contains 166 lines and 7 columns. The first column encodes gender, and has the character type. The rest are all numeric, and contain the data which will be used for the regression analysis. The “attitude”, “deep”, “stra”, and “surf” columns contain the combinations of results from the original dataframe. The full metadata explanation can be found here. The explanation for the study can be found here.
Briefly:
- attitude variable is Global attitude toward statistics composite score of the questions having to do with global attitude toward statistics
- deep variable is Deep approach adjusted composite score of the questions connected with deep understanding of statistics
- stra variable is Strategic approach adjusted composite score of the questions connected with how strategically the participant approaches the subject
- surf variable is Surface approach adjusted composite score of the questions connected with how whether the participant can understand the material deeply and whther the participants has problems studying.
2.2.2 Data exploration
2.2.2.1 Summary for the variables
summaries <- learning2014_csv %>%
summarise(across(where(is.numeric), list(mean = ~mean(.), sd = ~sd(.), min = ~min(.), max = ~max(.), median = ~median(.))))
print(t(summaries))
## [,1]
## age_mean 25.5120482
## age_sd 7.7660785
## age_min 17.0000000
## age_max 55.0000000
## age_median 22.0000000
## attitude_mean 3.1427711
## attitude_sd 0.7299079
## attitude_min 1.4000000
## attitude_max 5.0000000
## attitude_median 3.2000000
## deep_mean 3.6797189
## deep_sd 0.5541369
## deep_min 1.5833333
## deep_max 4.9166667
## deep_median 3.6666667
## stra_mean 3.1212349
## stra_sd 0.7718318
## stra_min 1.2500000
## stra_max 5.0000000
## stra_median 3.1875000
## surf_mean 2.7871486
## surf_sd 0.5288405
## surf_min 1.5833333
## surf_max 4.3333333
## surf_median 2.8333333
## points_mean 22.7168675
## points_sd 5.8948836
## points_min 7.0000000
## points_max 33.0000000
## points_median 23.0000000
2.2.2.2 Graphical data summary
ggpairs(learning2014_csv, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()
Description
- The gender distribution is uneven
- The age Is also skewed to younger participants, with few only a older participants in each gender
- No other variable has a stark clustering behaviour
- There are significant correlations between variables
- Negative:
- surf/deep - these are expected to be inversely correlated, as are surf/attitude and surf/strategy
- age/surf appear to be negatively correlated, but the p-value is between 0.5 and 0.1, as it is for surf/points.
- Positive:
- attitude/points, stra/points
- Negative:
These results make sense at the first glance.
2.2.3 Linear regression
I have chosen the three variables to check: attitude, age and surf
linear_modelling_for_learning2014_csv <- lm(points ~ surf + attitude + age, data = learning2014_csv)
summary(linear_modelling_for_learning2014_csv)
##
## Call:
## lm(formula = points ~ surf + attitude + age, data = learning2014_csv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.973 -3.430 0.167 3.997 10.444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.85628 3.53786 4.765 4.18e-06 ***
## surf -0.96049 0.79943 -1.201 0.231
## attitude 3.42388 0.57353 5.970 1.45e-08 ***
## age -0.08713 0.05361 -1.625 0.106
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.294 on 162 degrees of freedom
## Multiple R-squared: 0.2082, Adjusted R-squared: 0.1935
## F-statistic: 14.2 on 3 and 162 DF, p-value: 2.921e-08
Only one explanatory variable showed statistically significant connection to the points outcome, thus only attitude variable will be kept for the next modelling step.
linear_modelling_for_learning2014_csv_pruned <-lm(points ~ attitude, data = learning2014_csv)
summary(linear_modelling_for_learning2014_csv_pruned)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014_csv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The second model is simpler, but have not lost too much of the explanatory power: - The p-value is comparably low, much lower than 0.001, meaning that there is strong support to reject the hypothesis that the 2 variables have no connection. - The estimate shows how the two variables are connected, for one unit change in points the attitude variable changes by 3.5255 The adjusted R-value 0.1856 is relatively low, indicating that the model may not be very effective in predicting or explaining the dependent variable. This might be due to various reasons like missing important predictors, non-linear relationships, or high levels of noise in the data.
2.2.3.1 Linear model diagnositic plots
plot(linear_modelling_for_learning2014_csv_pruned, which=c(1,2,5))
Using these plots we can investigate the assumptions of the model.
- We can use the “Residuals vs fitted” plots to investigate the constant variance assumption. If the variance is constant, we should expect to see point distributed without noticeable structure i.e. randomly. This is more or less what we see in our data. Although the points around 28 on the x axis seem to be bunched up, this may also be the result of low n-numbers, so we can assume constant variance.
- The q-q recapitulates plot can be used to identify whether the “normal distribution of residuals” is met. The middle part of the plot follows the 45 degree line, meaning that the distribution is close to normal for the bulk of the data. The tails of the plot do deviate form the line, meaning that there is a deviation form the perfect normality. At the bottom left there are 3 points that can be considered outliers. Further investigation is needed to determine whether these point should not be included into furhter analysis.
- The residuals vs Leverage plot shows whitewater the data contains influential outliers. Three points are highlighted: 71, 56 and 35. These should be investigated to determine if the data is fine, for example there can be a data entry mistake, missing value problem, or a true outlier. The points 56 and 35 are present on both the q-q and the residuals vs leverage plots, so the next step would be checking the data to see what happened to these points.
3: Logistic regression
Necessary packages
if (!require(tidyverse) == T) {
install.packages("tidyverse")
}
library(tidyverse)
if (!require(gt) == T) {
install.packages("gt")
}
## Loading required package: gt
library(gt)
3.1 Data wrangling
The dataframes were merged a modified according to the instructions. The resulting dataframe contains 370 observations with 35 columns. The r-script used is here.
3.2 Read the file and describe the data a bit
aa_alc <- read_csv("/home/alex/ods_course/IODS-project_R/data/aa_alc.csv")
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl (1): high_use
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(aa_alc)
## spc_tbl_ [370 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ school : chr [1:370] "GP" "GP" "GP" "GP" ...
## $ sex : chr [1:370] "F" "F" "F" "F" ...
## $ age : num [1:370] 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr [1:370] "U" "U" "U" "U" ...
## $ famsize : chr [1:370] "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr [1:370] "A" "T" "T" "T" ...
## $ Medu : num [1:370] 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : num [1:370] 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr [1:370] "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr [1:370] "teacher" "other" "other" "services" ...
## $ reason : chr [1:370] "course" "course" "other" "home" ...
## $ guardian : chr [1:370] "mother" "father" "mother" "mother" ...
## $ traveltime : num [1:370] 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : num [1:370] 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : chr [1:370] "yes" "no" "yes" "no" ...
## $ famsup : chr [1:370] "no" "yes" "no" "yes" ...
## $ activities : chr [1:370] "no" "no" "no" "yes" ...
## $ nursery : chr [1:370] "yes" "no" "yes" "yes" ...
## $ higher : chr [1:370] "yes" "yes" "yes" "yes" ...
## $ internet : chr [1:370] "no" "yes" "yes" "yes" ...
## $ romantic : chr [1:370] "no" "no" "no" "yes" ...
## $ famrel : num [1:370] 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : num [1:370] 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : num [1:370] 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : num [1:370] 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : num [1:370] 1 1 3 1 2 2 1 1 1 1 ...
## $ health : num [1:370] 3 3 3 5 5 5 3 1 1 5 ...
## $ failures : num [1:370] 0 0 2 0 0 0 0 0 0 0 ...
## $ paid : chr [1:370] "no" "no" "yes" "yes" ...
## $ absences : num [1:370] 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : num [1:370] 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : num [1:370] 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : num [1:370] 8 8 11 14 12 14 12 10 18 14 ...
## $ average_alc: num [1:370] 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi [1:370] FALSE FALSE TRUE FALSE FALSE FALSE ...
## - attr(*, "spec")=
## .. cols(
## .. school = col_character(),
## .. sex = col_character(),
## .. age = col_double(),
## .. address = col_character(),
## .. famsize = col_character(),
## .. Pstatus = col_character(),
## .. Medu = col_double(),
## .. Fedu = col_double(),
## .. Mjob = col_character(),
## .. Fjob = col_character(),
## .. reason = col_character(),
## .. guardian = col_character(),
## .. traveltime = col_double(),
## .. studytime = col_double(),
## .. schoolsup = col_character(),
## .. famsup = col_character(),
## .. activities = col_character(),
## .. nursery = col_character(),
## .. higher = col_character(),
## .. internet = col_character(),
## .. romantic = col_character(),
## .. famrel = col_double(),
## .. freetime = col_double(),
## .. goout = col_double(),
## .. Dalc = col_double(),
## .. Walc = col_double(),
## .. health = col_double(),
## .. failures = col_double(),
## .. paid = col_character(),
## .. absences = col_double(),
## .. G1 = col_double(),
## .. G2 = col_double(),
## .. G3 = col_double(),
## .. average_alc = col_double(),
## .. high_use = col_logical()
## .. )
## - attr(*, "problems")=<externalptr>
colnames(aa_alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "average_alc" "high_use"
The data show student achievement in secondary education of two Portuguese schools, the data contains 370 observations of 35 variables . The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. The different features can be used to analyse the dataset and make predictions. A more detailed information can be found here
“school”: The name or type of school the student attends.
“sex”: The student’s gender.
“age”: The student’s age.
“address”: The student’s home address or type of locality (urban/rural).
“famsize”: The size of the student’s family.
“Pstatus”: The marital status of the student’s parents.
“Medu”: The mother’s education level.
“Fedu”: The father’s education level.
“Mjob”: The mother’s job.
“Fjob”: The father’s job.
“reason”: The reason for choosing this school.
“guardian”: The student’s primary guardian.
“traveltime”: Time taken to travel to school.
“studytime”: Time spent on studying outside of school.
“schoolsup”: Whether the student receives school support.
“famsup”: Whether the student receives family support.
“activities”: Participation in extracurricular activities.
“nursery”: Whether the student attended nursery school.
“higher”: Aspirations for higher education.
“internet”: Access to the internet at home.
“romantic”: Involvement in a romantic relationship.
“famrel”: Quality of family relationships.
“freetime”: Amount of free time after school.
“goout”: Frequency of going out with friends.
“Dalc”: Daily alcohol consumption.
“Walc”: Weekly alcohol consumption.
“health”: General health status.
“failures”: Number of past class failures.
“paid”: Whether the student is enrolled in extra paid classes.
“absences”: Number of school absences.
“G1”: Grade in the first period.
“G2”: Grade in the second period.
“G3”: Final grade.
“average_alc”: Average alcohol consumption (perhaps a computed variable from Dalc and Walc).
“high_use”: Indicator of high alcohol use (likely a derived variable).
3.3 Choosing the variable
I decided to look at “failures”, “absences”, “freetime” and “famrel”. I would expect that:
- The failures might be positively associated with the higher consumption
- The abscences might also be positively associated with the higher consumption
- The freetime may be associated with weekly consuption, but It is possible that the association will be inverse.
- The family relations may be inversely correlated.
3.4 Testing the assumptions:
summaries <- aa_alc %>% group_by(high_use) %>% select(.,c("failures", "absences", "freetime", "famrel","high_use")) %>%
summarise(across(where(is.numeric), list(Average = ~mean(.), sd = ~sd(.))))
gt(summaries) %>% cols_align("left") %>% opt_stylize(color="gray", style=3) %>% tab_header("Table for chosen variables")
| Table for chosen variables | ||||||||
| high_use | failures_Average | failures_sd | absences_Average | absences_sd | freetime_Average | freetime_sd | famrel_Average | famrel_sd |
|---|---|---|---|---|---|---|---|---|
| FALSE | 0.1196911 | 0.4370868 | 3.710425 | 4.464017 | 3.119691 | 0.9869096 | 4.007722 | 0.8935266 |
| TRUE | 0.3513514 | 0.7464906 | 6.378378 | 7.060845 | 3.468468 | 0.9421428 | 3.765766 | 0.9337603 |
These are the averages and standard deviation values for all the variables, to better see which variables are interesting to look at closer.
First looking at overall structure, whehther there is something to connect the chosen variables.
ggpairs(select(aa_alc,c("failures", "absences", "freetime", "famrel","high_use")), mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()
As it turns out the free time is strongly correlated with family relations, which I did not expect.
Failures
ggpairs(select(aa_alc,c("failures","high_use")), mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()
The number of failures is slightly higher in the high consumption group, but not bu much. The overwhelming majority of participants in each case have 0 falures. This was expected..
Abscences
ggpairs(select(aa_alc,c("absences","high_use")), mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()
The number of absences in the high use groups is higher, as expected. The standard deviation is also much higher in this group, suggesting more variability in the absences.
Freetime
ggpairs(select(aa_alc,c("freetime","high_use")), mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()
The freetime variable has different mean and very similar sd, the high consumption group has a slightly higher average. I was interested in the relation, and the results are interesting.
Family relation
ggpairs(select(aa_alc,c("famrel","high_use")), mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()
The family relation variable seems to be lower in the high consumption group, as expected.
3.5 Using logistic regression
first_log_regression <- glm(high_use ~ failures + absences + freetime + famrel, data = aa_alc, family = "binomial")
# create model summary and extract the coeffs
summary(first_log_regression)
##
## Call:
## glm(formula = high_use ~ failures + absences + freetime + famrel,
## family = "binomial", data = aa_alc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.55560 0.63703 -2.442 0.014608 *
## failures 0.58297 0.20259 2.878 0.004007 **
## absences 0.08375 0.02230 3.755 0.000173 ***
## freetime 0.43091 0.12804 3.365 0.000764 ***
## famrel -0.31981 0.13098 -2.442 0.014622 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 408.17 on 365 degrees of freedom
## AIC: 418.17
##
## Number of Fisher Scoring iterations: 4
From the model we can see the following:
- The higher values of failures absences and the freetime predictors are associated with a higher likelihood of high_use being TRUE, for family relations the opposite is true.
- All predictors appear to be statistically significant, as indicated by their p-values and significance codes
- The reduction in deviance from null to residual suggests the model with predictors fits better than the null model.
- The reduction in deviance from null to residual suggests the model with predictors fits better than the null model - the one without ony predictor variables.
coeffs <- summary(first_log_regression) %>% coefficients()
coeffs
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.55560199 0.63703172 -2.441954 0.0146080178
## failures 0.58296988 0.20258843 2.877607 0.0040070412
## absences 0.08375499 0.02230317 3.755295 0.0001731376
## freetime 0.43090906 0.12804024 3.365419 0.0007642749
## famrel -0.31981102 0.13098388 -2.441606 0.0146221015
OddsRatio <- exp(coeffs)
confidence <- confint(first_log_regression)
## Waiting for profiling to be done...
result <- cbind(OddsRatio, confidence)
result
## Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
## (Intercept) 0.2110623 1.890860 0.08699073 1.014715 -2.8317664 -0.32620631
## failures 1.7913506 1.224568 17.77169287 1.004015 0.1917587 0.99258448
## absences 1.0873624 1.022554 42.74681512 1.000173 0.0420769 0.13014453
## freetime 1.5386556 1.136599 28.94562433 1.000765 0.1840339 0.68715409
## famrel 0.7262863 1.139949 0.08702100 1.014730 -0.5789180 -0.06341917
Here we can see the same, the first three predictors have positive association with consumption, the family relation variable has a negative association. We can see that the strongest connection is with failures, then freetime and famrel, while the link to absences is low.
It is important to keep in mind that this is logistical regression, not the previosuly investigated linear regression. Tha means that the estimate in this table represents the “odds ratos” and be thought of in terms of likelihood. It means, that for examples in our case the increase in failures by one unit increases the likelihood of high alcohol consumption by 1.8.
The results are in agreement with the hypotheses I started with. Interestingly there seems to be a connection between the free time and the consumption levels, which were not obvious when just looking at the data.
3.6 Exploring the predictive power
We can use test the predictive of the model that we created earlier to see if it can be used to actually predict the alcohol consumption based on the the four chosen variables. We can predict for each student the probability of increased consumption.
aa_alc$predicted_probabilities <- predict(first_log_regression, type = "response")
aa_alc <- mutate(aa_alc,prediction = predicted_probabilities > 0.5)
table(Actual = aa_alc$high_use, Predicted = aa_alc$prediction)
## Predicted
## Actual FALSE TRUE
## FALSE 236 23
## TRUE 85 26
These are the 2x2 cross tabulation count results.
((table(Actual_percentage = aa_alc$high_use, Predicted_percentage = aa_alc$prediction))/length(aa_alc$prediction)) * 100
## Predicted_percentage
## Actual_percentage FALSE TRUE
## FALSE 63.783784 6.216216
## TRUE 22.972973 7.027027
The False outcome was correctly predicted for 236 participants (63%), True was predicted for 26 (7%).
library(ggplot2)
ggplot(aa_alc, aes(x=high_use, y=prediction)) +
geom_jitter(color="blue") +
theme_minimal() +
labs(title="Actual vs Predicted Alcohol Consumption")
confusion_matrix <- table(Predicted = aa_alc$prediction,Actual = aa_alc$high_use)
training_error <- 1 - sum(diag(confusion_matrix)) / sum(confusion_matrix)
training_error
## [1] 0.2918919
The total training error was 29%
3.6.2 Comparing to simple guessing strategy
We can compare our model to a simple guessing strategy - always predicting the most common class
First we determine which is the bigger group:
sum(aa_alc$high_use)/length(aa_alc$high_use)
## [1] 0.3
There are 30% of higher consuming participants, so the more prevalent group is low consuming.
simple_guess_accuracy <- mean(aa_alc$high_use == F)
model_accuracy <- mean(aa_alc$high_use == aa_alc$prediction)
simple_guess_accuracy
## [1] 0.7
model_accuracy
## [1] 0.7081081
The accuracy of the model is marginally better than guessing.
3.7 Cross-validating the model
We can preform a cross-validation of the model, meaning that we will
test the model performance on subset of the same data fo determine how
accurately the chosen model can work in practice. For that we will be
using the cv.glm function from the boot library. The idea
is to test how well the model predicts the status using the ten
different subset subsets fo the data in sequence.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
library(boot)
cv <- cv.glm(data = aa_alc, cost = loss_func, glmfit = first_log_regression, K = 10)
cv$delta[1]
## [1] 0.2972973
The prediction of my model is 0.289 compared to the 0.26, which is worse. I used 4 predictors, and i did not include the sex predictor into my model, which might be important. Additionally two of my chosen factors were correlated quite strongly, so one might not add a lot to the model.
3.8
We can try to use all the predictors to see how the number of predictors influences the model parameters. For this we exclude all the predictors that were used to create the outcome variable
First we load additional lobraries
if (!require(caret) == T) {
install.packages("caret")
}
## Loading required package: caret
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(caret)
if (!require(glmnet) == T) {
install.packages("glmnet")
}
## Loading required package: glmnet
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-8
library(glmnet)
predictors <- setdiff(colnames(aa_alc), c("Dalc","Walc","average_alc","high_use","predicted_probabilities","prediction" ))
# Initialize vectors to store errors
training_errors <- c()
testing_errors <- c()
number_of_predictors <- c()
model_summaries <- list()
aa_alc$high_use <- as.factor(aa_alc$high_use)
# Loop over models with decreasing number of predictors
for (i in length(predictors):1) {
# Select a subset of predictors
current_predictors <- predictors[1:i]
formula <- as.formula(paste("high_use ~", paste(current_predictors, collapse = "+")))
set.seed(123) # for reproducibility
fitControl <- trainControl(method = "cv", number = 10) # 10-fold cross-validation
model <- train(formula, data = aa_alc, method = "glm", family = "binomial", trControl = fitControl)
# Store training and testing errors
training_errors <- c(training_errors, mean(model$results$Accuracy))
testing_errors <- c(testing_errors, 1 - max(model$results$Accuracy))# Replace 'Accuracy' with the appropriate metric
number_of_predictors <- c(number_of_predictors, i)
model_summaries[[i]] <- model$coefnames
}
results_df <- data.frame(
Number_of_Predictors = number_of_predictors,
Training_Error = training_errors,
Testing_Error = testing_errors
)
# Plotting
ggplot(results_df, aes(x = Number_of_Predictors)) +
geom_line(aes(y = Training_Error, colour = "Training Error")) +
geom_line(aes(y = Testing_Error, colour = "Testing Error")) +
labs(title = "Training and Testing Errors by Number of Predictors",
x = "Number of Predictors",
y = "Error") +
theme_minimal()
This shows, that the number of predictors makes the model worse, however
there is an increase in model performance at 26th point in the plot. If
we look at the predicot first present in the 26th point:
setdiff(model_summaries[[26]],model_summaries[[25]])
## [1] "failures"
As we know, the failures was a great predictor. So
to check how the plot looks if we start with the better predictors I
reversed the predictors vector.
predictors <- rev(predictors)
# Initialize vectors to store errors
training_errors <- c()
testing_errors <- c()
number_of_predictors <- c()
model_summaries <- list()
# Loop over models with decreasing number of predictors
for (i in length(predictors):1) {
# Select a subset of predictors
current_predictors <- predictors[1:i]
formula <- as.formula(paste("high_use ~", paste(current_predictors, collapse = "+")))
set.seed(123) # for reproducibility
fitControl <- trainControl(method = "cv", number = 10) # 10-fold cross-validation
model <- train(formula, data = aa_alc, method = "glm", family = "binomial", trControl = fitControl)
# Store training and testing errors
training_errors <- c(training_errors, mean(model$results$Accuracy))
testing_errors <- c(testing_errors, 1 - max(model$results$Accuracy))# Replace 'Accuracy' with the appropriate metric
number_of_predictors <- c(number_of_predictors, i)
model_summaries[[i]] <- model$coefnames
}
results_df <- data.frame(
Number_of_Predictors = number_of_predictors,
Training_Error = training_errors,
Testing_Error = testing_errors
)
# Plotting
ggplot(results_df, aes(x = Number_of_Predictors)) +
geom_line(aes(y = Training_Error, colour = "Training Error")) +
geom_line(aes(y = Testing_Error, colour = "Testing Error")) +
labs(title = "Training and Testing Errors by Number of Predictors",
x = "Number of Predictors",
y = "Error") +
theme_minimal()
That makes the model better overall, with the minimum value being:
## [1] 0.2454599
With adding of the guardian variable.
predictors[20]
## [1] "guardian"
This shows that increasing the number of predictors negatively influences the model, but adding valueable predictors imrpoves it.
4 Clustering and classification
4.1.1 Loading necessary libraries
library(MASS) # For the Boston dataset
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(ggplot2) # For graphical representations
library(caret) # For data splitting and preprocessing
library(corrplot)
## corrplot 0.92 loaded
4.1.2 Loading the data and investigating the data structure
# Step 1: Load and Explore the Boston Dataset
data("Boston")
# Exploring the structure and dimensions of the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
This dataset contains housing values in the suburbs of Boston, has 506 rows and 14 columns, all numerical, chas can be considered categorical as it can only be 0 and 1. Each row represents a different suburb. Columns are various features like crime rate, number of rooms, age of the housing, etc. More complete description can be found here
The variables have the foillowing decriptions:
- CRIM - per capita crime rate by town
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town.
- CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX - nitric oxides concentration (parts per 10 million)
- RM - average number of rooms per dwelling
- AGE - proportion of owner-occupied units built prior to 1940
- DIS - weighted distances to five Boston employment centres
- RAD - index of accessibility to radial highways
- TAX - full-value property-tax rate per $10,000
- PTRATIO - pupil-teacher ratio by town
- B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT - % lower status of the population
- MEDV - Median value of owner-occupied homes in $1000’s
4.1.3 Graphical Overview and Summary of Variables
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
All the statistics for each variable is not directly comparable, as is expected for real-world data. We can now the relations between the variables to look deeper into the data.
The distributions are skewed: nox and dis have skewed to low values, age to high values. The proportion of non-retail business acres per town, has a distribution with two peaks, as does the tax variable.
All the variables appear to be correlated. Only 8 pairs are not significantly correlated, all connected to the “categorical variable” Charles River. Most of the correlation make sense: e.g. nox concentrations and the distance to the city are negatively correlated, but nox and industrialisation are positively correlated.
Crime rate appears to have a statistically significant correlation all of vars, but chas. Seems to correlate negatively with 5 variables, and positively with 7 variables.
We can plot the correlation in a more visually pleasing way. All the relations between the variables can be seen clearly on this figure
cor_matrix <- cor(Boston)
corrplot(cor_matrix)
All the relations between the variables can be seen clearly on this
figure
4.1.4 Standardising the dataset
As the data described very different phenomena, the values were not directly comparable. E.g. mean value for tax was 408, while nox was ~ 0.55. In order to eliminate that we can standardise the data.
scaled_Boston <- as.data.frame(scale(Boston))
summary(scaled_Boston)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
Now all the means are the same - 0. This standardization makes variables more comparable and often improves machine learning model performance.
4.1.4.1 Creating a categorical variable from scaled crime rate
Using quantiles as breakpoints, and removing the old crime rate variable.
quantiles <- quantile(scaled_Boston[, "crim"], probs = c(0, 0.25, 0.5, 0.75, 1))
scaled_Boston$categorical_crim <- cut(scaled_Boston[, "crim"], breaks = quantiles,
include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
scaled_Boston <- scaled_Boston[,-which(names(Boston) == "crim")]
Splitting the Dataset into Train and Test Sets
80% of the data is now in the training set, and the remaining 20% is in the test set.
set.seed(123) # For reproducibility
indexes <- createDataPartition(scaled_Boston$categorical_crim, p = 0.8, list = FALSE)
train_set <- scaled_Boston[indexes, ]
test_set <- scaled_Boston[-indexes, ]
4.1.5 Fitting the linear discriminant analysis on the train set
# Loading necessary library
library(MASS) # For lda function
library(ggplot2) # For creating biplot
# Step 1: Fit the Linear Discriminant Analysis Model
# Fitting LDA with categorical crime rate as target variable
lda_model <- lda(categorical_crim ~ ., data = train_set)
# Step 2: Summary of the LDA Model
lda_model
## Call:
## lda(categorical_crim ~ ., data = train_set)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2512315 0.2487685 0.2487685 0.2512315
##
## Group means:
## zn indus chas nox rm age
## low 0.9279155 -0.9002730 -0.19513102 -0.8740789 0.4510628 -0.8867213
## med_low -0.0913696 -0.2911617 -0.03844192 -0.5807849 -0.1223160 -0.3793710
## med_high -0.3836558 0.1931104 0.15646403 0.3930228 0.1140281 0.4230461
## high -0.4872402 1.0149946 -0.07933396 1.0662955 -0.4222547 0.8161999
## dis rad tax ptratio black lstat
## low 0.8428080 -0.6789912 -0.7285408 -0.39200564 0.37567764 -0.77093834
## med_low 0.3597389 -0.5361295 -0.4615109 -0.03982856 0.31663833 -0.16048504
## med_high -0.3650188 -0.4542586 -0.3302709 -0.28998920 0.08434782 -0.06173942
## high -0.8559422 1.6596029 1.5294129 0.80577843 -0.76539336 0.90035435
## medv
## low 0.5324073
## med_low -0.0140094
## med_high 0.1977447
## high -0.7285395
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11143638 0.643005769 -0.9943789
## indus 0.08740548 -0.227273609 0.2772096
## chas 0.01766274 -0.058628846 0.1539449
## nox 0.18483474 -0.897262214 -1.3248989
## rm -0.01332317 -0.033607221 -0.1228014
## age 0.25569474 -0.405630630 -0.1938632
## dis -0.12730753 -0.346775059 0.0996160
## rad 4.07862012 0.908418329 -0.1089458
## tax 0.06001905 0.008407111 0.5321260
## ptratio 0.15504095 0.072786263 -0.3434405
## black -0.10715634 0.034155460 0.1240672
## lstat 0.14772261 -0.131485744 0.4522005
## medv 0.04936251 -0.364822504 -0.2088124
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9637 0.0274 0.0088
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# The biplot visualizes the linear discriminants (LD1 and LD2) and shows
# how the observations in the training set are separated based on the
# categorical crime rate.
classes <- as.numeric(train_set$categorical_crim)
plot(lda_model, dimen = 2, col = classes,pch = classes)
lda.arrows(lda_model, myscale = 2)
The LD1 is influences mostly by rad, the LD2 is influeneces by zn and nox in similar marnitude, but different direction.
4.1.6 Fitting the linear discriminant analysis on the train set
# Step 6: Save Crime Categories and Update Test Set
# Saving the crime categories from the test set
test_crime_categories <- test_set$categorical_crim
# Removing the categorical crime variable from the test set
test_set <- test_set[,-which(names(test_set) == "categorical_crim")]
# Step 7: LDA Model Prediction
# Fit LDA model on the training set
library(MASS) # LDA is in the MASS package
fit_lda <- lda(categorical_crim ~ ., data = train_set)
# Predicting on the test set
predictions <- predict(fit_lda, newdata = test_set)
table, prop.table and
addmargins
table(Predicted = predictions$class, Actual = test_crime_categories) %>%
addmargins()
## Actual
## Predicted low med_low med_high high Sum
## low 17 3 1 0 21
## med_low 8 15 4 0 27
## med_high 0 7 17 1 25
## high 0 0 3 24 27
## Sum 25 25 25 25 100
The model is reasonably good, especially at the high and low values, and can be used to differentiate high and low groups. The middle groups become muddled.
4.1.7 K-means clustering
Reloading the dataset, scaling and calculating the distances:
data(Boston)
Boston_scaled <- data.frame(scale(Boston))
dist_euc <- dist(Boston_scaled)
summary(dist_euc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Mean distance is 4.7. In order to determine the optimal number of clusters we will look at the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. We can perform clustering with different k numbers and then look at what happens to the WCSS when the k number is increased. If the value drops rapidly - the k number is good. But the larger the k the harder it may be to interpret the results. So we can start checking and plotting k from 2 to 20, jst to see what happens.
k_max <- 20
twcs <- sapply(1:k_max, function(k){kmeans(Boston_scaled, k)$tot.withinss})
elbow_data <- data.frame(
k = 1:k_max,
twcs = twcs
)
elbow_plot <- ggplot(elbow_data, aes(x = k, y = twcs)) +
geom_line() + # Connect points with lines
geom_point() + # Add points
scale_x_continuous(breaks = 1:k_max) + # Ensure k values are treated as continuous
labs(x = "Number of Clusters (k)", y = "Total Within-Cluster Sum of Squares",
title = "Determining Optimal k") +
theme_minimal()
elbow_plot
The plot changes the slope quite quickly at k = 2, so we can use this clustering for the further analysis.
Boston_scaled_km <- kmeans(Boston_scaled, 2)
pairs(Boston_scaled,col = Boston_scaled_km$cluster)
pairs(Boston_scaled[,c(1,10)],col = Boston_scaled_km$cluster)
There seems to be good cluster separation between crime and tax.
pairs(Boston_scaled[,c(3,5)],col = Boston_scaled_km$cluster)
An even better separation for indus and nox. High indus and high nox
cluster, and cluster of both low values.
pairs(Boston_scaled[,c(2,3)],col = Boston_scaled_km$cluster)
There seems also to be good separation for industry and zn variables,
two clusters: low indus and high zn, and vice versa. As we saw earlier
the chosen variables have high correlation, and logically should also be
correlated. So these results make sense.
4.1.8* Bonus. Visualising clustering results with a biplot:
The elbow plot shows that the last point with decreased WCSS is 6, so I decided to look what k=9 clustering looks like. And redoing the LDA using the cluster as the target classes.
set.seed(8) # For reproducibility
data("Boston")
Boston_scaled <- as.data.frame(scale(Boston))
Boston_scaled_km <- kmeans(Boston_scaled, centers = "6")
lda.fit <- lda(Boston_scaled_km$cluster ~ . , data = Boston_scaled)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(Boston_scaled_km$cluster)
# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2,col = classes,pch = classes)
lda.arrows(lda.fit, myscale = 2)
As we can see, the cluster 3 is saparate from all the othersm and the
Charles River variable is the main determinant for this cluster. Cluster
5 is separate from others, this separation is dependent on the radial
roads variable.
4.1.9* Bonus. Trainig data that you used to fit the LDA and visualisation:
Installing/loading plotly
if (!require(plotly) == T) {
install.packages("plotly")
}
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(plotly)
lda.fit <- lda(categorical_crim ~ . , data = train_set)
model_predictors <- dplyr::select(train_set, -categorical_crim)
# check the dimensions
dim(model_predictors)
## [1] 406 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, color = train_set$categorical_crim, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, color = factor(Boston_scaled_km$cluster[indexes]), type= 'scatter3d', mode='markers')
The crime colouring shows that high crime group is clustered separately, with some med-high groups. The k-means clustering colouring shows clusters 3 and 4 to be together, as in 2d plain, however the 3rd group is split between the two clusters.
4.2 Data wrangling for the next week’s data!
The R script transforming the data for the next week asignment is in
the repository,
with the human.csv file in the same directory
5: Dimensionality reduction techniques
5.1 Data wrangling
The data wrangling was continued in this script These are the descritions, a more detailed information can be found here: https://hdr.undp.org/data-center/documentation-and-downloads, first and second table:
- HDI_rank: “HDI Rank” in the first table, indicating the country’s global ranking based on the Human Development Index.
- Country: The name of the country being evaluated, present in both tables.
- HDI: “Human Development Index (HDI) Value” in the first table, measuring average achievement in key human development dimensions.
- LifeExp_B: “Life Expectancy at Birth (years)” from the first table, showing the average number of years a newborn is expected to live under current mortality rates.
- EduExp: “Expected Years of Schooling (years)” in the first table, estimating total years of schooling a child is expected to receive.
- EduMean: “Mean Years of Schooling (years)” in the first table, representing the average years of education received by people aged 25 and older.
- GNI_C: “Gross National Income (GNI) Per Capita (2017 PPP $)” from the first table, indicating the average income of a country’s citizens, adjusted to purchasing power parity.
- GNI_C_HDIrank: “GNI Per Capita Rank Minus HDI Rank” from the first table, showing the difference between the country’s GNI per capita rank and its HDI rank.
- GII_Rank: The ranking of the country in the Gender Inequality Index, part of the second table.
- GII: “Gender Inequality Index” from the second table, measuring gender disparities in reproductive health, empowerment, and economic activity.
- MMRatio: “Maternal Mortality Ratio” from the second table, indicating the number of maternal deaths per 100,000 live births.
- AdBRate: “Adolescent Birth Rate” from the second table, referring to the number of births per 1,000 women aged 15-19.
- RP_p: “Share of Seats in Parliament” in the second table, representing the percentage of parliamentary seats held by women.
- F_2ED_p and M_2ED_p: The percentage of the female and male population, respectively, with at least some secondary education, as indicated in the second table.
- F_LFPR and M_LFPR: The “Labor Force Participation Rate” for females and males, respectively, from the second table.
- FM_2ED_Ratio: A metric comparing the ratio of females to males with at least some secondary education.
- FM_LFPR_Ratio: A metric comparing the ratio of females to males labor force.
5.2 Analysis
Loading additional libraries
if (!require(FactoMineR) == T) {
install.packages("FactoMineR")
}
## Loading required package: FactoMineR
library(FactoMineR)
5.2.1 Graphical overview and description of the data
human_data_ <- read.csv(file="/home/alex/ods_course/IODS-project_R/data/human.csv")
human_data_ <- column_to_rownames(human_data_,var ="Country")
Graphical summary:
ggpairs(human_data_, progress = F)
- All the distributions are skewed, meaning that extreme values are more common for some variables. Life expectancy tends to high, GNI_C tend to low.
- A lot of variables are strongly correlated positively and
negatively:
- Life expectancy, education levels, GNI are positively correlated with good significance, and are negatively correlated with “Maternal Mortality Ratio” and “Adolescent Birth Rate” with similar significance. This is to be expected.
- MMortality is correlated with adolscent births, which is to be expected.
- Higher secondary education in females is positivly correlated with GNI and life expectancy.
These resutls are not surprising, as a lot of the indexes are associated with the overall wealth, even though there might be outliers.
Summary:
summary(human_data_)
## FM_2ED_Ratio FM_LFPR_Ratio EduExp LifeExp_B
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7254 1st Qu.:0.5981 1st Qu.:11.22 1st Qu.:66.42
## Median :0.9327 Median :0.7514 Median :13.45 Median :74.00
## Mean :0.8499 Mean :0.7034 Mean :13.13 Mean :71.58
## 3rd Qu.:0.9958 3rd Qu.:0.8525 3rd Qu.:15.20 3rd Qu.:76.95
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI_C MMRatio AdBRate RP_p
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4495 1st Qu.: 13.0 1st Qu.: 13.57 1st Qu.:12.50
## Median : 12081 Median : 55.0 Median : 35.00 Median :19.30
## Mean : 17344 Mean : 150.3 Mean : 47.35 Mean :20.88
## 3rd Qu.: 23112 3rd Qu.: 190.0 3rd Qu.: 71.78 3rd Qu.:27.55
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
The data is not normalised, so it is necesserily not really comparable.
5.2.2 Performing principal component analysis (PCA) on the raw (non-standardized) human data
We perform the PCA analysis for non-standardised data to see why it is important:
human_data_PCA_raw <- prcomp(human_data_)
summary(human_data_PCA_raw)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.821e+04 183.5476 24.84 11.23 3.696 1.539 0.1913 0.1568
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
Most of the variance is included into the first component.
biplot(human_data_PCA_raw, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
The GNI_C is mostly contributing to the PC1. That is because GNI_C has values up to 123124, which is a lot more than any other variable. The nations with high GNI_C are to the left.
5.2.3 Standardising the variables in the human data and repeat the above analysis
As we saw earlier the data is to dissimilar to use for PCA, hence scaling:
human_data_scaled <- scale(human_data_)
summary(human_data_scaled)
## FM_2ED_Ratio FM_LFPR_Ratio EduExp LifeExp_B
## Min. :-2.8446 Min. :-2.5987 Min. :-2.7620 Min. :-2.7457
## 1st Qu.:-0.5220 1st Qu.:-0.5286 1st Qu.:-0.6816 1st Qu.:-0.6272
## Median : 0.3476 Median : 0.2408 Median : 0.1131 Median : 0.2937
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6123 3rd Qu.: 0.7484 3rd Qu.: 0.7381 3rd Qu.: 0.6524
## Max. : 2.7134 Max. : 1.6796 Max. : 2.5239 Max. : 1.4487
## GNI_C MMRatio AdBRate RP_p
## Min. :-0.9206 Min. :-0.7127 Min. :-1.1509 Min. :-1.8531
## 1st Qu.:-0.7057 1st Qu.:-0.6554 1st Qu.:-0.8315 1st Qu.:-0.7435
## Median :-0.2891 Median :-0.4549 Median :-0.3041 Median :-0.1398
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3167 3rd Qu.: 0.1896 3rd Qu.: 0.6012 3rd Qu.: 0.5925
## Max. : 5.8094 Max. : 4.5338 Max. : 3.8760 Max. : 3.2512
human_data_PCA_scaled <- prcomp(human_data_scaled)
summary(human_data_PCA_scaled)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0705 1.1430 0.87523 0.77704 0.66349 0.53451 0.45617
## Proportion of Variance 0.5359 0.1633 0.09575 0.07547 0.05503 0.03571 0.02601
## Cumulative Proportion 0.5359 0.6992 0.79496 0.87043 0.92546 0.96117 0.98718
## PC8
## Standard deviation 0.32021
## Proportion of Variance 0.01282
## Cumulative Proportion 1.00000
human_data_PCA_scaled
## Standard deviations (1, .., p=8):
## [1] 2.0705291 1.1430478 0.8752283 0.7770364 0.6634878 0.5345100 0.4561665
## [8] 0.3202124
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## FM_2ED_Ratio -0.35624746 0.05544659 -0.258173727 0.61994816 -0.5923317
## FM_LFPR_Ratio 0.05140236 0.72511638 -0.574905230 0.04778188 0.2842048
## EduExp -0.42844344 0.13794051 -0.068952381 -0.06994677 0.1577789
## LifeExp_B -0.44425251 -0.02871318 0.112160198 -0.05130932 0.1525719
## GNI_C -0.35074145 0.05123343 -0.198230209 -0.73273506 -0.4859385
## MMRatio 0.43708962 0.14765550 -0.123634905 -0.25658826 -0.1736624
## AdBRate 0.41035458 0.08931249 0.008796231 0.05375321 -0.4810245
## RP_p -0.08404578 0.64720641 0.728586153 0.01511136 -0.1500635
## PC6 PC7 PC8
## FM_2ED_Ratio 0.19541082 0.057381277 0.16336822
## FM_LFPR_Ratio -0.03316792 -0.226716792 -0.07410761
## EduExp -0.39597156 0.776784601 -0.05176554
## LifeExp_B -0.42259100 -0.439295194 0.62590822
## GNI_C 0.11842241 -0.138379206 -0.16985528
## MMRatio 0.17148338 0.351654675 0.72304994
## AdBRate -0.74968006 -0.078049728 -0.14549600
## RP_p 0.14102220 0.005543132 -0.02360083
The PC1 is still the most important component, but now it only explains around 54% of variance, PC2 captures around 16%, then PC3 less than 10% and all the other components display diminishing returns.
names_for_the_biplot <- round(summary(human_data_PCA_scaled)$importance[2, ] * 100, digits = 2)
names_for_the_biplot <- paste0(names(names_for_the_biplot), " (", names_for_the_biplot, "%)")
biplot(human_data_PCA_scaled, choices = 1:2, xlab = names_for_the_biplot[1], ylab = names_for_the_biplot[2], cex = c(0.5, 0.7))
These results make al ot more sense and also are in concordance with what I originally concluded form the first graphical analysis:
- PC1 is negatively associated with GNI, Life expectancy, Education and the ratio of educated females in the - all correlated positiveley, and increase in MMortality and adolescent births leads to increase in PC1. These values were correlated to each other.
- PC2 increases with the number of female representatives and female percentage in labour force, which are correlated to each other. As mentioned above the differences are due to the GNI being a much larger influence without scaling. With scaling we can actually see what’s happening.
5.2.4 My interpretation of the data.
The PC1 is related to what is usually described as development level of a nation. Developed nations have high income, better education and life expectancy, with developing countries and failed states have high adolescent pregnancies and mortality. It is widely believed, that female education is the main driving factor in the population reproducibility index decrease.
The second PC shows that there is a connection between female representation in government and overall inclusion in work force, but it is not strictly connected to the “PC1 - development level” meaning that in poorer countries females can also participate in the government and workforce. Also in petrocracies of the Middle East, despite high GNI and associated parameters females have lower representation.
5.2.5 Exploring the tea data and analysis
library(FactoMineR)
tea_data <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
# View(tea_data)
str(tea_data)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
Everything is a factor, apart from age. We can shoose the variables to look at. Seems more interesting to choose factual things, rather than emotional responses. The first 18 are the ones to choose from.
summary(tea_data)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 +60 :38 +2/day :127
## middle :40 sportsman :179 15-24:92 1 to 2/week: 44
## non-worker :64 25-34:69 1/day : 95
## other worker:20 35-44:40 3 to 6/week: 34
## senior :35 45-59:61
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming exciting
## feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
## Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
##
##
##
##
##
## relaxing effect.on.health
## No.relaxing:113 effect on health : 66
## relaxing :187 No.effect on health:234
##
##
##
##
##
dim(tea_data)
## [1] 300 36
keep_tea_columns <- c("Tea", "How", "how", "sex", "frequency","age_Q")
# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea_data, all_of(keep_tea_columns))
I choose theese variables to look at.
ggpairs(tea_time,progress = F)
summary(tea_time)
## Tea How how sex frequency
## black : 74 alone:195 tea bag :170 F:178 +2/day :127
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 M:122 1 to 2/week: 44
## green : 33 milk : 63 unpackaged : 36 1/day : 95
## other: 9 3 to 6/week: 34
##
## age_Q
## +60 :38
## 15-24:92
## 25-34:69
## 35-44:40
## 45-59:61
People like Earl Grey. There are prefered methods of drinking tea.
mca <- MCA(tea_time)
summary(mca)
##
## Call:
## MCA(X = tea_time)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.262 0.241 0.210 0.196 0.194 0.184 0.174
## % of var. 10.460 9.638 8.416 7.858 7.744 7.375 6.950
## Cumulative % of var. 10.460 20.098 28.514 36.372 44.115 51.490 58.440
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.169 0.152 0.143 0.142 0.122 0.112 0.108
## % of var. 6.777 6.069 5.723 5.674 4.894 4.493 4.324
## Cumulative % of var. 65.217 71.287 77.010 82.683 87.578 92.070 96.395
## Dim.15
## Variance 0.090
## % of var. 3.605
## Cumulative % of var. 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 0.522 0.347 0.113 | 0.206 0.059 0.018 | -0.471
## 2 | 0.406 0.210 0.069 | -0.092 0.012 0.004 | -0.982
## 3 | -0.227 0.066 0.040 | -0.383 0.203 0.112 | -0.064
## 4 | -0.534 0.363 0.221 | 0.338 0.158 0.088 | -0.377
## 5 | -0.029 0.001 0.001 | 0.058 0.005 0.002 | -0.140
## 6 | -0.534 0.363 0.221 | 0.338 0.158 0.088 | -0.377
## 7 | 0.113 0.016 0.004 | 0.120 0.020 0.005 | 0.134
## 8 | 0.486 0.302 0.069 | -0.258 0.092 0.019 | -0.746
## 9 | 0.165 0.035 0.010 | -0.072 0.007 0.002 | -0.257
## 10 | 0.576 0.422 0.132 | -0.267 0.099 0.028 | 0.273
## ctr cos2
## 1 0.352 0.092 |
## 2 1.527 0.403 |
## 3 0.007 0.003 |
## 4 0.225 0.110 |
## 5 0.031 0.014 |
## 6 0.225 0.110 |
## 7 0.028 0.006 |
## 8 0.881 0.162 |
## 9 0.105 0.025 |
## 10 0.118 0.030 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 1.178 21.809 0.454 11.654 | -0.433 3.192 0.061
## Earl Grey | -0.530 11.509 0.506 -12.304 | 0.021 0.020 0.001
## green | 0.457 1.467 0.026 2.780 | 0.845 5.427 0.088
## alone | -0.198 1.620 0.073 -4.659 | 0.014 0.009 0.000
## lemon | 0.169 0.200 0.004 1.027 | 0.272 0.563 0.009
## milk | 0.251 0.841 0.017 2.235 | 0.136 0.270 0.005
## other | 1.910 6.974 0.113 5.808 | -2.256 10.559 0.157
## tea bag | -0.146 0.771 0.028 -2.889 | 0.007 0.002 0.000
## tea bag+unpackaged | -0.194 0.749 0.017 -2.261 | -0.198 0.849 0.018
## unpackaged | 1.195 10.929 0.195 7.633 | 0.485 1.953 0.032
## v.test Dim.3 ctr cos2 v.test
## black -4.280 | -0.072 0.102 0.002 -0.715 |
## Earl Grey 0.498 | -0.004 0.001 0.000 -0.094 |
## green 5.134 | 0.186 0.301 0.004 1.129 |
## alone 0.330 | 0.328 5.524 0.199 7.718 |
## lemon 1.654 | 0.582 2.951 0.042 3.538 |
## milk 1.216 | -1.200 23.938 0.383 -10.694 |
## other -6.859 | -0.834 1.651 0.021 -2.535 |
## tea bag 0.133 | -0.563 14.221 0.414 -11.130 |
## tea bag+unpackaged -2.312 | 0.555 7.642 0.140 6.481 |
## unpackaged 3.097 | 1.209 13.897 0.199 7.720 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.546 0.125 0.005 |
## How | 0.151 0.165 0.430 |
## how | 0.195 0.041 0.451 |
## sex | 0.089 0.407 0.011 |
## frequency | 0.021 0.286 0.270 |
## age_Q | 0.567 0.422 0.096 |
plot(mca, graph.type="ggplot", invisible=c("var","quali.sup","quanti.sup"),cex=0.8)
plot(mca, graph.type="ggplot", invisible=c("ind"), cex=0.8, habillage="quali")
Looking at the plot we can assume that there is a correlation between the age and which tea the person prefers:
- 15-24 prefer earl gray
- oldest prefer black tea.
- 25-35 prefer green tea
- Drinking tea straight is a younger quality, older prefer tea with something.
- Middle age is associated with green tea bags daily.
6: Analysis of longitudinal data
Loading libraries
library(GGally) # For the Boston dataset
library(ggplot2) # For graphical representations
library(caret) # For data splitting and preprocessing
library(corrplot)
library(lme4)
First reading the data created previously.
6.1: Data wrangling
The wrangling was performed per instructions and the resulting r-script with commentaries can be found here.
6.2.1 Implementing the analyses for rats data
rats_long <- read_csv("/home/alex/ods_course/IODS-project_R/data/rats_long.csv")
## Rows: 176 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): WD
## dbl (4): ID, Group, Weight, Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
rats_long <- rats_long %>% mutate(ID = factor(ID), Group = factor(Group))
The we look at the data:
summary(rats_long)
## ID Group WD Weight Time
## 1 : 11 1:88 Length:176 Min. :225.0 Min. : 1.00
## 2 : 11 2:44 Class :character 1st Qu.:267.0 1st Qu.:15.00
## 3 : 11 3:44 Mode :character Median :344.5 Median :36.00
## 4 : 11 Mean :384.5 Mean :33.55
## 5 : 11 3rd Qu.:511.2 3rd Qu.:50.00
## 6 : 11 Max. :628.0 Max. :64.00
## (Other):110
glimpse(rats_long)
## Rows: 176
## Columns: 5
## $ ID <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ WD <chr> "WD1", "WD8", "WD15", "WD22", "WD29", "WD36", "WD43", "WD44", "…
## $ Weight <dbl> 240, 250, 255, 260, 262, 258, 266, 266, 265, 272, 278, 225, 230…
## $ Time <dbl> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36,…
It would be nice to see the summary per group, since there are three:
rats_long %>% group_by(Group) %>% filter(Time == 1) %>% summarise( Mean_Weight = mean(Weight, na.rm = TRUE), sd_weight = sd(Weight , na.rm = TRUE),weight_count = n())
## # A tibble: 3 × 4
## Group Mean_Weight sd_weight weight_count
## <fct> <dbl> <dbl> <int>
## 1 1 251. 15.2 8
## 2 2 454. 69.8 4
## 3 3 509. 27.8 4
For whatever reason the first group has twice the number of rats, the lowest weight and sd. It’s not good, seems to be from a different experiment? I am not sure it’s really comparable. If we plot the wight trajectories we:
rats_long %>% group_by(Group,Time) %>% summarise( Mean_Weight = mean(Weight, na.rm = TRUE), se_weight = (sd(Weight , na.rm = TRUE)/sqrt(n())),) %>% ungroup() %>% ggplot(., aes(x = Time, y = Mean_Weight, colour = Group)) +
geom_line()+
geom_errorbar(aes(ymin=Mean_Weight-se_weight, ymax=Mean_Weight+se_weight), width=0.3,linetype =2 )
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
The groups had very different starting points and different growth curves. We can look at each rat to look for outliers:
ggplot(rats_long, aes(x = Time, y = Weight,colour = ID)) +
geom_line(aes(group = ID))+
# scale_linetype_manual(values = c(1,2,3)) +
# geom_point(size=3) +
# scale_shape_manual(values = c(1,2,3)) +
# geom_errorbar(aes(ymin=Mean_Weight-se_weight, ymax=Mean_Weight+se_weight), width=0.3) +
# theme(legend.position = c(0.1,0.3)) +
# scale_y_continuous(name = "mean(weight) +/- se(weight)")+
facet_wrap(~Group)
As we can see, each group has a potential outlier - lowest in the 1st group, highest in the second group, lowest again in the third. We can scale the results to see more clearly:
rats_long %>% group_by(Group) %>% mutate( Weight_scaled = ((Weight-mean(Weight))/sd(Weight))) %>% ungroup() %>% ggplot(., aes(x = Time, y = Weight_scaled)) +
geom_line(aes(group = ID))+
geom_text(aes(label = ID))+
facet_wrap(~Group)
The second rat, the 12th and the 13th are likely ouliers. The 2nd is too little and never recovers. The 12 is too large and gets larger. 13 th start small, but gets larger fater than others. I would look at the original data to see if there was something wrong with these mice. We can remove the outliers for gurther analysis.
rats_long_no_outliers <- rats_long %>% filter(ID != "2" & ID != "12" & ID != "13")
rats_long_no_outliers %>% group_by(Group) %>% mutate( Weight_scaled = ((Weight-mean(Weight))/sd(Weight))) %>% ungroup() %>% ggplot(., aes(x = Time, y = Weight_scaled)) +
geom_line(aes(group = ID))+
geom_text(aes(label = ID))+
facet_wrap(~Group)
Much nicer, I will look at these data next. It’s clear without test, that the weights are different. However we are interested in the weight gain, not the weights themselves. We should get the percentage of weight gain between the groups. We can add a new variable called percent_change, which will be the percent of body fat gained compared to the previous week
rats_long_no_outliers_with_percentage_gained <- rats_long_no_outliers %>%
arrange(ID, Time) %>%
group_by(ID) %>%
mutate(
weight_previous = lag(Weight),
percent_change = (Weight - weight_previous) / weight_previous * 100 # Calculate percentage change
) %>%
replace_na(list(percent_change = 0))
We can plot the weight change:
rats_long_no_outliers_with_percentage_gained %>% ggplot(., aes(x = Time, y = percent_change,colour = ID)) +
geom_line(aes(group = ID))+
geom_text(aes(label = ID))+
facet_wrap(~Group)
rats_long_no_outliers_with_percentage_gained <- rats_long_no_outliers_with_percentage_gained %>%
arrange(ID, Time) %>% # Sort by individual and then by timepoint
group_by(ID) %>% # Group by individual
mutate(
starting_weight = dplyr::first(Weight) # Get the starting weight for each individual
)
These plots show us that the 2nd diet seems to be the most reliable, as the mice grow without much variation. The first and third diet have larger variance in weight gain, which might be not desirable.
We can test the differences pair-wise using t-test:
First we chack for equal variance, and then perform t-test.
- Groups 1 and 2
r_1_2 <- rats_long_no_outliers_with_percentage_gained %>% filter(Group != 3 )
bartlett.test(r_1_2$percent_change, r_1_2$Group)
##
## Bartlett test of homogeneity of variances
##
## data: r_1_2$percent_change and r_1_2$Group
## Bartlett's K-squared = 5.4216, df = 1, p-value = 0.01989
t.test(percent_change ~ Group, data = r_1_2,var.equal = F)
##
## Welch Two Sample t-test
##
## data: percent_change by Group
## t = -1.5828, df = 86.028, p-value = 0.1171
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -0.9961404 0.1130061
## sample estimates:
## mean in group 1 mean in group 2
## 0.8305754 1.2721426
- Groups 2 and 3
r_2_3 <- rats_long_no_outliers_with_percentage_gained %>% filter(Group != 1 )
bartlett.test(r_2_3$percent_change, r_2_3$Group)
##
## Bartlett test of homogeneity of variances
##
## data: r_2_3$percent_change and r_2_3$Group
## Bartlett's K-squared = 0.59147, df = 1, p-value = 0.4419
t.test(percent_change ~ Group, data = r_2_3,var.equal = F)
##
## Welch Two Sample t-test
##
## data: percent_change by Group
## t = 2.0699, df = 62.832, p-value = 0.04258
## alternative hypothesis: true difference in means between group 2 and group 3 is not equal to 0
## 95 percent confidence interval:
## 0.02201286 1.25387841
## sample estimates:
## mean in group 2 mean in group 3
## 1.272143 0.634197
- Groups 1 and 3
r_1_3 <- rats_long_no_outliers_with_percentage_gained %>% filter(Group != 2 )
bartlett.test(r_1_3$percent_change, r_1_3$Group)
##
## Bartlett test of homogeneity of variances
##
## data: r_1_3$percent_change and r_1_3$Group
## Bartlett's K-squared = 2.2412, df = 1, p-value = 0.1344
t.test(percent_change ~ Group, data = r_1_3,var.equal = T)
##
## Two Sample t-test
##
## data: percent_change by Group
## t = 0.59432, df = 108, p-value = 0.5535
## alternative hypothesis: true difference in means between group 1 and group 3 is not equal to 0
## 95 percent confidence interval:
## -0.4585832 0.8513401
## sample estimates:
## mean in group 1 mean in group 3
## 0.8305754 0.6341970
That means, that there is only one significant difference between groups 2 and 3.
We can also perform ANOVA analysis to see if the results are the same. It may be wise to also check for if the starting weight into the formula
percent_change_fit <- lm(percent_change ~ Group + starting_weight, data = rats_long_no_outliers_with_percentage_gained)
anova(percent_change_fit)
## Analysis of Variance Table
##
## Response: percent_change
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 7.249 3.6246 1.6935 0.187646
## starting_weight 1 18.200 18.1998 8.5034 0.004134 **
## Residuals 139 297.501 2.1403
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
percent_change_fit23 <- lm(percent_change ~ Group + starting_weight, data = r_2_3)
anova(percent_change_fit23)
## Analysis of Variance Table
##
## Response: percent_change
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 1 6.715 6.7151 4.7383 0.033253 *
## starting_weight 1 11.028 11.0279 7.7816 0.006973 **
## Residuals 63 89.283 1.4172
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
percent_change_fit13 <- lm(percent_change ~ Group + starting_weight, data = r_1_3)
anova(percent_change_fit13)
## Analysis of Variance Table
##
## Response: percent_change
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 1 0.891 0.8908 0.3617 0.54884
## starting_weight 1 8.850 8.8500 3.5932 0.06071 .
## Residuals 107 263.536 2.4629
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA did not show significant differences between the three groups in the percentage of weight gained. However there is still a difference between the 2nd and 3rd group. Starting weight is also very improtant, as I suspected. The results are not really comparable.
6.2.2 Implementing the analyses for BPRS data
To
bprs_long <- read_csv("/home/alex/ods_course/IODS-project_R/data/bprs_long.csv")
## Rows: 360 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): weeks
## dbl (4): treatment, subject, pbrs, week
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bprs_long <- bprs_long %>% mutate(treatment = factor(treatment), subject = factor(subject))
summary(bprs_long)
## treatment subject weeks pbrs week
## 1:180 1 : 18 Length:360 Min. :18.00 Min. :0
## 2:180 2 : 18 Class :character 1st Qu.:27.00 1st Qu.:2
## 3 : 18 Mode :character Median :35.00 Median :4
## 4 : 18 Mean :37.66 Mean :4
## 5 : 18 3rd Qu.:43.00 3rd Qu.:6
## 6 : 18 Max. :95.00 Max. :8
## (Other):252
There should be 40 subjects, but we only have 20 subject names. That means that for this analysis we should first make a new variable, identifying the different humans correctly:
bprs_long$human <- paste0(bprs_long$subject,"_",bprs_long$treatment)
We can visualise the data first:
bprs_long %>% ggplot(., aes(x = week, y = pbrs,colour = treatment)) +
geom_line(aes(group = human))
# scale_linetype_manual(values = c(1,2,3)) +
# geom_point(size=3) +
# scale_shape_manual(values = c(1,2,3)) +
# geom_errorbar(aes(ymin=Mean_Weight-se_weight, ymax=Mean_Weight+se_weight), width=0.3) +
# theme(legend.position = c(0.1,0.3)) +
# scale_y_continuous(name = "mean(weight) +/- se(weight)")+
# facet_wrap(~treatment)
Not much sense in the data visualisation. So we use linear modelling. As we know, the data is from the same individuals at multiple timepoints, the indivudal responses are very important.
First try without random intercept model.
bprs_lm_wt <- lm(pbrs ~ week + treatment, data = bprs_long)
summary(bprs_lm_wt)
##
## Call:
## lm(formula = pbrs ~ week + treatment, data = bprs_long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
There is a significant connection with the week variable, and no with treatment, however the model is not adequate. We can use a random intercept model. Each test subject can have different regression slope intercepts. There is no assumption that the observation are independent, which is good, because they are not
bprs_lm_wt__subject <- lmer(pbrs ~ week + treatment + (1 | human), data = bprs_long, REML = FALSE)
# Print the summary of the model
summary(bprs_lm_wt__subject)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: pbrs ~ week + treatment + (1 | human)
## Data: bprs_long
##
## AIC BIC logLik deviance df.resid
## 2582.9 2602.3 -1286.5 2572.9 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.27506 -0.59909 -0.06104 0.44226 3.15835
##
## Random effects:
## Groups Name Variance Std.Dev.
## human (Intercept) 97.39 9.869
## Residual 54.23 7.364
## Number of obs: 360, groups: human, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.3521 19.750
## week -2.2704 0.1503 -15.104
## treatment2 0.5722 3.2159 0.178
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.256
## treatment2 -0.684 0.000
The model is more suitable to this data. Looking at the random effects, the variance is 97, sd 9.8 and the residual is 54. We can check the confidence intervals.
confint(bprs_lm_wt__subject)
## Computing profile confidence intervals ...
## 2.5 % 97.5 %
## .sig01 7.918218 12.645375
## .sigma 6.828332 7.973573
## (Intercept) 41.744598 51.163179
## week -2.565919 -1.974914
## treatment2 -5.885196 7.029641
Still no treatment effect, as the 0 is inside the interval for treatment.
We can try to improve the model by adding week as a random slope variable:
bprs_lm_wt__week_subject<-lmer(pbrs ~ week + treatment + (week | human), data = bprs_long, REML = FALSE)
summary(bprs_lm_wt__week_subject)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: pbrs ~ week + treatment + (week | human)
## Data: bprs_long
##
## AIC BIC logLik deviance df.resid
## 2523.2 2550.4 -1254.6 2509.2 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4655 -0.5150 -0.0920 0.4347 3.7353
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## human (Intercept) 167.827 12.955
## week 2.331 1.527 -0.67
## Residual 36.747 6.062
## Number of obs: 360, groups: human, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 45.9830 2.6470 17.372
## week -2.2704 0.2713 -8.370
## treatment2 1.5139 3.1392 0.482
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.545
## treatment2 -0.593 0.000
The variance in random effects is much higher, residual is lower. we can check the confidence interval:
confint(bprs_lm_wt__subject)
## Computing profile confidence intervals ...
## 2.5 % 97.5 %
## .sig01 7.918218 12.645375
## .sigma 6.828332 7.973573
## (Intercept) 41.744598 51.163179
## week -2.565919 -1.974914
## treatment2 -5.885196 7.029641
Still no effect of treatment.
anova(bprs_lm_wt__subject,bprs_lm_wt__week_subject)
## Data: bprs_long
## Models:
## bprs_lm_wt__subject: pbrs ~ week + treatment + (1 | human)
## bprs_lm_wt__week_subject: pbrs ~ week + treatment + (week | human)
## npar AIC BIC logLik deviance Chisq Df
## bprs_lm_wt__subject 5 2582.9 2602.3 -1286.5 2572.9
## bprs_lm_wt__week_subject 7 2523.2 2550.4 -1254.6 2509.2 63.663 2
## Pr(>Chisq)
## bprs_lm_wt__subject
## bprs_lm_wt__week_subject 1.499e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both the A and B criteria are lower with the more complex model, all that means that new model is better. We can add an interaction variable to the model. This will check, if the effect of treatment is different depending on the week.
bprs_lm_wt_w_t_week_subject <- lmer(pbrs ~ week + treatment + week*treatment + (week | human), data = bprs_long, REML = FALSE)
summary(bprs_lm_wt_w_t_week_subject)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: pbrs ~ week + treatment + week * treatment + (week | human)
## Data: bprs_long
##
## AIC BIC logLik deviance df.resid
## 2523.5 2554.5 -1253.7 2507.5 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4747 -0.5256 -0.0866 0.4435 3.7884
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## human (Intercept) 164.204 12.814
## week 2.203 1.484 -0.66
## Residual 36.748 6.062
## Number of obs: 360, groups: human, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.9840 16.047
## week -2.6283 0.3752 -7.006
## treatment2 -2.2911 4.2200 -0.543
## week:treatment2 0.7158 0.5306 1.349
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.668
## treatment2 -0.707 0.473
## wek:trtmnt2 0.473 -0.707 -0.668
anova(bprs_lm_wt__week_subject,bprs_lm_wt_w_t_week_subject)
## Data: bprs_long
## Models:
## bprs_lm_wt__week_subject: pbrs ~ week + treatment + (week | human)
## bprs_lm_wt_w_t_week_subject: pbrs ~ week + treatment + week * treatment + (week | human)
## npar AIC BIC logLik deviance Chisq Df
## bprs_lm_wt__week_subject 7 2523.2 2550.4 -1254.6 2509.2
## bprs_lm_wt_w_t_week_subject 8 2523.5 2554.6 -1253.7 2507.5 1.78 1
## Pr(>Chisq)
## bprs_lm_wt__week_subject
## bprs_lm_wt_w_t_week_subject 0.1821
There is practically no difference between the models.
It seems, that there is no difference between the effect of the two treatments on the pbrs outcome. However there is a difference dependent on the week.